Group members: Bryan D’Amico, Gustavo Gyotoku, Mackenna McCosh, Sachi Singh
The purpose of this report is to record the data manipulation and analysis procedures that were conducted during the course of preparing the presentation on the Spotify Most Popular Songs data set. The goal of the analysis is to give insight on what kind of artist should be newly signed to the label and what kind of song they should produce to have the best chances of having the next highly popular song on Spotify.
We will begin by introducing the definitions of the variables for this data set. These definitions were taken from the Kaggle page for the data set which was uploaded by Mark Koverha. The page for the data set can be found at the following URL: https://www.kaggle.com/datasets/paradisejoy/top-hits-spotify-from-20002019
The explanations in this paper assume the reader is familiar with the definitions for each variable. Also, please note that although the title of the data set on the Kaggle website says 2000 to 2019, the data ranges from 1998 to 2020.
Before we began any analyses on the data, we loaded all the packages we were going to use. Next, we set a seed so the tests that involve randomness could be reproduced. Then we read in the data set, and after taking a look at it realized there were duplicate songs and songs with a popularity rating of 0. These were removed from the data set. The remaining data still exceeded the requirement of over 10,000 values.
#Loading packages for data manipulation and analysis.
library(tidyverse)
library(rMIDAS)
library(PerformanceAnalytics)
library(readr)
library(BEST)
library(caret)
#Some tests (such as the Bayesian BESTmcmc models) involve elements of randomness. If the tests are desired to be reproduced then the user should set the same seed.
set.seed(12345)
#Reading in the csv data set and saving it as a data frame.
dfSpotify <- read_csv("~/Desktop/Intro Data Science/songs_normalize.csv")
## Rows: 2000 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): artist, song, genre
## dbl (14): duration_ms, year, popularity, danceability, energy, key, loudness...
## lgl (1): explicit
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#While exploring the data, we noticed there were duplicated songs and songs with a popularity rating of 0. These were removed. Also, we felt that measuring the length of a song in minutes was more user friendly than in milliseconds.
dfSpotifyPopular <- dfSpotify %>%
filter(popularity != 0) %>%
filter(duplicated(song) == FALSE) %>%
mutate(duration_min = duration_ms / 1000 /60)
#Confirming that the modified data set still meets the project criteria.
dim(dfSpotifyPopular)
## [1] 1762 19
We started by creating line plots showing the change in the median values of the songs by year using the main numeric metrics in the data set. We hoped to see positive, negative, or flat trends that could explain how people’s taste in music is changing or staying the same over time.
Liveness trended upwards for some time, but returned back down to a low level during the most recent years in the data set. Higher tempo songs are currently preferred over slower songs. The overall valence of popular songs, which is a measure of how happy the song sounds, has seen a decrease over time. Speechiness, which is a measure of how much of the song sounds like spoken word, decreased during the years between 2008 and 2014, but has recently seen an increase above its previous levels. In recent years, the acousticness, or how acoustic a song sounds, has increased. We can also see that at no point in time have the most popular songs been instrumental, or lacking in vocals. Danceability showed a U shaped plot with the rating for the most popular songs dipping down from 2005 to 2010, but is at a higher value again in recent years. Recently, the energy, or intensity of the song, has decreased among the most popular songs. The length of the most popular songs has seen a steady decline over time. Lastly, with the exception of a spike during the middle years of the data set, the loudness of the most popular songs has stayed fairly constant. Please note that our analysis of this data was focused on trends and not on the actual values of the metrics. For example, although we discussed the change in speechiness, it should be noted that by looking at the y-axis of the histogram that speechiness has never been favorable over the duration of the data set.
#To visualize the changes in each of the recorded song metrics over time, a series of line plots were made. All of these line plots show the change in the median value of the metric over time. First, we filtered the data set to only include songs from the years 1999 to 2019 because there is only one song in the list from 1998 and one song from 2000. The group_by() and summarize() functions are used to calculate the median value of the metric for each year.
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_liveness = median(liveness)) %>%
ggplot(aes(x = year, y = median_liveness)) +
geom_line() +
ggtitle("Median Liveness of Popular Songs Over Time") +
labs(x = "Year", y = "Median Liveness")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_tempo = median(tempo)) %>%
ggplot(aes(x = year, y = median_tempo)) +
geom_line() +
ggtitle("Median Tempo of Popular Songs Over Time") +
labs(x = "Year", y = "Median Tempo")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_valence = median(valence)) %>%
ggplot(aes(x = year, y = median_valence)) +
geom_line() +
ggtitle("Median Valence (Happiness) of Popular Songs Over Time") +
labs(x = "Year", y = "Median Valence")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_speechiness = median(speechiness)) %>%
ggplot(aes(x = year, y = median_speechiness)) +
geom_line() +
ggtitle("Median Presence of Spoken Word Attributes of Popular Songs Over Time") +
labs(x = "Year", y = "Median Speechiness")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_acousticness = median(acousticness)) %>%
ggplot(aes(x = year, y = median_acousticness)) +
geom_line() +
ggtitle("Median Acousticness of Popular Songs Over Time") +
labs(x = "Year", y = "Median Acousticness")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_instrumentalness = median(instrumentalness)) %>%
ggplot(aes(x = year, y = median_instrumentalness)) +
geom_line() +
ggtitle("Median Lack of Vocals of Popular Songs Over Time") +
labs(x = "Year", y = "Median Instrumentalness")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_danceability = median(danceability)) %>%
ggplot(aes(x = year, y = median_danceability)) +
geom_line() +
ggtitle("Median Danceability of Popular Songs Over Time") +
labs(x = "Year", y = "Median Danceability")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_energy = median(energy)) %>%
ggplot(aes(x = year, y = median_energy)) +
geom_line() +
ggtitle("Median Intensity of Popular Songs Over Time") +
labs(x = "Year", y = "Median Energy")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_duration_min = median(duration_min)) %>%
ggplot(aes(x = year, y = median_duration_min)) +
geom_line() +
ggtitle("Median Length in Minutes of Popular Songs Over Time") +
labs(x = "Year", y = "Median Duration in Minutes")
dfSpotifyPopular %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_loudness = median(loudness)) %>%
ggplot(aes(x = year, y = median_loudness)) +
geom_line() +
ggtitle("Median Loudness of Popular Songs Over Time") +
labs(x = "Year", y = "Median Loudness")
After examining the line plots and taking note of the trends, we separated the main data set into multiple data sets broken up by approximately half decades.
#Creating separate data sets breaking up the data into approximately 5 year increments to further examine the changes in song metrics over time.
dfSpotify2005 <- dfSpotifyPopular %>%
filter(year <= 2005) %>%
filter(year >= 2000)
dfSpotify2010 <- dfSpotifyPopular %>%
filter(year <= 2010) %>%
filter(year > 2005)
dfSpotify2015 <- dfSpotifyPopular %>%
filter(year <= 2015) %>%
filter(year > 2010)
dfSpotify2020 <- dfSpotifyPopular %>%
filter(year <= 2020) %>%
filter(year > 2015)
This was done in preparation for running a series of t-tests on the data to see if there have been significant changes among some of the metrics of the most popular songs over time. We focused on variables that appeared to change over time based on the line plots. First we investigated the duration of the songs. Industry research claims that the increasing popularity of streaming services, such as Spotify, has led to people preferring shorter songs. The following t-tests are an attempt to answer questions such as “Do popular songs tend to be shorter as time moves forward?” For all t-tests we will use an alpha threshold of 0.05 to determine statistical significance.
#t-test comparing the mean length of songs from 2006 to 2010 with the mean length of songs from 2000 to 2005.
t.test(dfSpotify2010$duration_min, dfSpotify2005$duration_min)
##
## Welch Two Sample t-test
##
## data: dfSpotify2010$duration_min and dfSpotify2005$duration_min
## t = -3.8066, df = 925.92, p-value = 0.0001502
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.24302448 -0.07768147
## sample estimates:
## mean of x mean of y
## 3.867819 4.028172
When performing the t-test the sample means are subtracted in the order in which the samples are given to the function. So negative values mean the 2010 songs are shorter and positive values mean that the 2005 songs are shorter. The results of this test are statistically significant with a p-value of less than 0.05. We reject the null hypothesis that there is no credible difference between the mean durations of popular songs from 2006 to 2010 and 2000 to 2005. Since the 95% confidence interval spans only negative values and does not overlap with 0, this evidence suggests that popular songs have gotten shorter.
#t-test comparing the mean length of songs from 2011 to 2015 with the mean length of songs from 2006 to 2010.
t.test(dfSpotify2015$duration_min, dfSpotify2010$duration_min)
##
## Welch Two Sample t-test
##
## data: dfSpotify2015$duration_min and dfSpotify2010$duration_min
## t = -2.4992, df = 839.24, p-value = 0.01264
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.18720487 -0.02250627
## sample estimates:
## mean of x mean of y
## 3.762963 3.867819
The results of this test are statistically significant with a p-value of less than 0.05. We reject the null hypothesis that there is no credible difference between the mean durations of popular songs from 2011 to 2015 and 2006 to 2010. Since the 95% confidence interval spans only negative values and does not overlap with 0, this evidence suggests that popular songs have gotten shorter.
#t-test comparing the mean length of songs from 2016 to 2020 with the mean length of songs from 2011 to 2015.
t.test(dfSpotify2020$duration_min, dfSpotify2015$duration_min)
##
## Welch Two Sample t-test
##
## data: dfSpotify2020$duration_min and dfSpotify2015$duration_min
## t = -6.3804, df = 706.06, p-value = 3.196e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.3688379 -0.1952582
## sample estimates:
## mean of x mean of y
## 3.480915 3.762963
The results of this test are statistically significant with a p-value of less than 0.05. We reject the null hypothesis that there is no credible difference between the mean durations of popular songs from 2016 to 2020 and 2011 to 2015. Since the 95% confidence interval spans only negative values and does not overlap with 0, this evidence suggests that popular songs have gotten shorter.
#t-test comparing the mean length of songs from 2016 to 2020 with the mean length of songs from 2000 to 2005.
t.test(dfSpotify2020$duration_min, dfSpotify2005$duration_min)
##
## Welch Two Sample t-test
##
## data: dfSpotify2020$duration_min and dfSpotify2005$duration_min
## t = -12.335, df = 757.18, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6343529 -0.4601602
## sample estimates:
## mean of x mean of y
## 3.480915 4.028172
It makes sense that when comparing the durations of songs from 2000 to 2005 with the durations of songs from 2016 to 2020, we see a similar result to the previous tests. Again, we reject the null hypothesis that there is no credible difference between the mean durations of popular songs from 2016 to 2020 and 2000 to 2005. Since the 95% confidence interval spans only negative values and does not overlap with 0, this evidence suggests that popular songs have gotten shorter.
To accompany our frequentist analysis using t-tests, we decided to also perform a Bayesian analysis. The following creates a distribution of posterior probabilities for estimating the mean length of popular songs released between 2000 and 2005 compared to those released between 2016 and 2020.
#Saving the BESTmcmc process to display the plot of the distribution of posterior probabilities.
songBestDuration <- BESTmcmc(dfSpotify2020$duration_min, dfSpotify2005$duration_min)
## Waiting for parallel processing to complete...done.
plot(songBestDuration)
We see that t100% of our mean difference estimations are negative. This
is overwhelming evidence that the mean duration of songs from 2000 to
2005 are longer than the mean duration of those from 2016 to 2020. The
most likely estimate for the difference in length is about one half of a
minute.
Based on all of the evidence, we are comfortable in claiming that the most popular songs in recent years are shorter than they were in the past.
Next, we investigated the change over time for valence, or how happy the song sounds. For this paper, we have chosen to show just the t-test comparing the valence from 2000 to 2005 with 2016 to 2020. The code for the other t-test have been provided and commented out.
#t.test(dfSpotify2010$valence, dfSpotify2005$valence)
#t.test(dfSpotify2015$valence, dfSpotify2010$valence)
#t.test(dfSpotify2020$valence, dfSpotify2015$valence)
t.test(dfSpotify2020$valence, dfSpotify2005$valence)
##
## Welch Two Sample t-test
##
## data: dfSpotify2020$valence and dfSpotify2005$valence
## t = -8.0073, df = 759.22, p-value = 4.4e-15
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.14580344 -0.08838813
## sample estimates:
## mean of x mean of y
## 0.4954733 0.6125690
Based on the p-value we can reject the null hypothesis. This test suggests that the valence of popular songs decreased since the 95% confidence interval is entirely negative.
The following is the Bayesian analysis of valence.
#Saving the BESTmcmc process to display the plot of the distribution of posterior probabilities.
songBestValence <- BESTmcmc(dfSpotify2020$valence, dfSpotify2005$valence)
## Waiting for parallel processing to complete...done.
plot(songBestValence)
We see that 100% of the distribution is greater than 0. This gives us
very strong evidence that popular songs between 2000 and 2005 were
happier than songs released between 2016 and 2020. The most likely
estimate for the difference in mean valence is a little more than one
tenth.
We will also investigate the change in acousticness over time. Like the previous set of t-tests, we will display the results for the comparison from between 2000 and 2005 to between 2016 and 2020 with the other tests commented out.
#t.test(dfSpotify2010$acousticness, dfSpotify2005$acousticness)
#t.test(dfSpotify2015$acousticness, dfSpotify2010$acousticness)
#t.test(dfSpotify2020$acousticness, dfSpotify2015$acousticness)
t.test(dfSpotify2020$acousticness, dfSpotify2005$acousticness)
##
## Welch Two Sample t-test
##
## data: dfSpotify2020$acousticness and dfSpotify2005$acousticness
## t = 2.3658, df = 691.95, p-value = 0.01827
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.005112354 0.055005361
## sample estimates:
## mean of x mean of y
## 0.1676376 0.1375787
Based on the p-value we can reject the null hypothesis. This test suggests that the acousticness of popular songs increased since the 95% confidence interval is entirely positive.
The following is the Bayesian analysis of acousticness.
#Saving the BESTmcmc process to display the plot of the distribution of posterior probabilities.
songBestAcoustic <- BESTmcmc(dfSpotify2020$acousticness, dfSpotify2005$acousticness)
## Waiting for parallel processing to complete...done.
plot(songBestAcoustic)
We can see from the distribution of posterior probabilities there is a
small but certain increase in the mean acousticness of songs over time,
with the most likely estimate of that increase being 0.0358.
Lastly, we will take a look at energy. Again we will show the results of the t-test comparing the mean energy of the songs released between 2016 and 2020 with those released between 2000 and 2005. The other t-tests are commented out.
#t.test(dfSpotify2010$energy, dfSpotify2005$energy)
#t.test(dfSpotify2015$energy, dfSpotify2010$energy)
#t.test(dfSpotify2020$energy, dfSpotify2015$energy)
t.test(dfSpotify2020$energy, dfSpotify2005$energy)
##
## Welch Two Sample t-test
##
## data: dfSpotify2020$energy and dfSpotify2005$energy
## t = -5.2709, df = 762.16, p-value = 1.769e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07567275 -0.03460211
## sample estimates:
## mean of x mean of y
## 0.6694913 0.7246287
Based on the p-value we can reject the null hypothesis that there is no difference in the mean values for energy. Since, the 95% confidence interval is entirely negative, we have evidence that suggests that the energy of the most popular songs has decreased over time.
The following is the Bayesian analysis of energy.
#Saving the BESTmcmc process to display the plot of the distribution of posterior probabilities.
songBestEnergy <- BESTmcmc(dfSpotify2020$energy, dfSpotify2005$energy)
## Waiting for parallel processing to complete...done.
plot(songBestEnergy)
This time we see a small but certain decrease in the mean energy of the
most popular songs over time, with the most likely estimate of that
decrease being 0.0563.
Next, we broke up the main data set into multiple data sets by major genre. The genres chosen were rock, hip hop, R&B, country, dance, and pop.
#The grepl() function was used to filter because many songs had multiple genres listed. So for example, any song that has rock listed as one of its genres is sent to the rock data frame. If a song belongs to two major genres such as pop-rock, then it will appear in the rock data frame and the pop data frame.
dfRock <- dfSpotifyPopular %>%
filter(grepl("rock", genre))
dfPop <- dfSpotifyPopular %>%
filter(grepl("pop", genre))
dfHipHop <- dfSpotifyPopular %>%
filter(grepl("hip hop", genre))
dfCountry <- dfSpotifyPopular %>%
filter(grepl("country", genre))
dfDance <- dfSpotifyPopular %>%
filter(grepl("Dance", genre))
dfRB <- dfSpotifyPopular %>%
filter(grepl("R&B", genre))
To explore the question of “Is one genre more popular than the others?” a series of box plots were created that show the distribution of popularity ratings across each genre. A summary() of each was also run to have the exact numbers shown in the box plots.
#Each box plot shows the distribution of popularity for the given genre. The summary() function is called afterwards so the reader can see exactly what numbers the box plot is representing.
dfRock %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for Rock Songs")
summary(dfRock$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 61.00 68.50 65.19 76.00 89.00
dfPop %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for Pop Songs")
summary(dfPop$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 58.00 66.00 63.22 73.00 89.00
dfHipHop %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for Hip Hop Songs")
summary(dfHipHop$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 59.00 67.00 64.62 74.00 87.00
dfCountry %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for Country Songs")
summary(dfCountry$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 50.00 57.75 63.00 62.44 66.75 76.00
dfDance %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for Dance Songs")
summary(dfDance$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 55.00 64.00 61.94 73.00 86.00
dfRB %>%
ggplot(aes(x = popularity))+
geom_boxplot() +
coord_flip() +
ggtitle("Distribution of Popularity Ratings for R&B Songs")
summary(dfRB$popularity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 57.00 64.00 61.46 70.00 86.00
It appears that rock may be the most popular genre. To test this, we created another data frame that contained every popular song not in the rock genre. This is to perform a t-test comparing the mean popularity of rock songs vs. the mean popularity of all other songs. Note that unlike before there is no overlap of songs in the two data sets. For a song to quality for the “not rock” data set it must not have rock listed for any of its genres.
#Creating a data frame that contains every song that is not in the rock genre. To do this we filtered using the grepl() function again, only this time we negated it to get all the songs where the genre data did not contain the word rock.
dfNoRock <- dfSpotifyPopular %>%
filter(!grepl("rock", genre))
Now we will perform a t-test to see if there is a credible difference between the mean popularity ratings of all the songs in the rock genre vs those songs not in the rock genre.
t.test(dfRock$popularity, dfNoRock$popularity)
##
## Welch Two Sample t-test
##
## data: dfRock$popularity and dfNoRock$popularity
## t = 1.4581, df = 263.57, p-value = 0.146
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6110388 4.0988675
## sample estimates:
## mean of x mean of y
## 65.19159 63.44767
The results of the t-test are not significant so we fail to reject the null hypothesis that the true difference in mean popularity of rock songs vs. all other songs is equal to 0. We did note that from looking at the 95% confidence interval that the great majority of the uncertainty in the variation would show rock as being more popular.
We also ran a Bayesian BESTmcmc process to analyze the difference of means.
rockPopularityBEST <- BESTmcmc(dfRock$popularity, dfNoRock$popularity)
## Waiting for parallel processing to complete...done.
plot(rockPopularityBEST)
This test tells us that 100% of the differences of sample means were
positive with a most likely estimate for the mean difference in
popularity being 2.95 points in favor of rock songs.
While the two results conflict with one another, we feel that the Bayesian output is convincing enough to move forward under the assumption that rock songs are on average more popular among the most popular songs on Spotify.
Since we will conclude that rock is, on average, the most popular genre we decided to focus on that genre moving forward.
Next, we investigated the most popular songs in the rock genre to see if these songs were following the same trends over time that we saw in the overall data set including all genres.
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_liveness = median(liveness)) %>%
ggplot(aes(x = year, y = median_liveness)) +
geom_line() +
ggtitle("Median Liveness of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Liveness")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_tempo = median(tempo)) %>%
ggplot(aes(x = year, y = median_tempo)) +
geom_line() +
ggtitle("Median Tempo of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Tempo")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_valence = median(valence)) %>%
ggplot(aes(x = year, y = median_valence)) +
geom_line() +
ggtitle("Median Valence (Happiness) of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Valence")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_speechiness = median(speechiness)) %>%
ggplot(aes(x = year, y = median_speechiness)) +
geom_line() +
ggtitle("Median Presence of Spoken Word Attributes of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Speechiness")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_acousticness = median(acousticness)) %>%
ggplot(aes(x = year, y = median_acousticness)) +
geom_line() +
ggtitle("Median Acousticness of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Acousticness")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_instrumentalness = median(instrumentalness)) %>%
ggplot(aes(x = year, y = median_instrumentalness)) +
geom_line() +
ggtitle("Median Lack of Vocals of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Instrumentalness")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_danceability = median(danceability)) %>%
ggplot(aes(x = year, y = median_danceability)) +
geom_line() +
ggtitle("Median Danceability of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Danceability")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_energy = median(energy)) %>%
ggplot(aes(x = year, y = median_energy)) +
geom_line() +
ggtitle("Median Intensity of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Energy")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_duration_min = median(duration_min)) %>%
ggplot(aes(x = year, y = median_duration_min)) +
geom_line() +
ggtitle("Median Length in Minutes of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Duration in Minutes")
dfRock %>%
filter(year %in% c(1999:2019)) %>%
group_by(year) %>%
summarize(median_loudness = median(loudness)) %>%
ggplot(aes(x = year, y = median_loudness)) +
geom_line() +
ggtitle("Median Loudness of Popular Rock Songs Over Time") +
labs(x = "Year", y = "Median Loudness")
We can see that although there is a lot of fluctuation in the median
liveness, there is not a strong upwards or downwards trend. The tempo of
the most popular rock songs appears to have a slight upwards trend. This
matches what we saw earlier across all genres of music. There is a lot
of fluctuation in median valence, but we do see a dip downwards during
the most recent years. There is also a lot of fluctuation for
speechiness, but note from the y-axis that it is always at a low level.
Acousticness is on the rise in recent years, closely matching the trend
across all genres of music. Instrumentalness is overall very low just
like it was across all genres. Danceability shows an upwards trend like
it did across all genres, but for rock there is a change where there is
a big dip downwards in the most recent years. The energy, or intensity,
of the songs have trended downwards like it did across all genres. The
length of rock songs does not show a strong downwards trend like there
was across all genres. The loudness of rock songs has also shown a
similar trend upwards then back downwards like there was across all
genres.
We believe these observations can be used to help understand what is currently popular in the rock genre.
The goal of this section is to build a model that will predict if a rock song is going to be popular based on the values of some of the song metrics contained in the data.
Before building a model we will explore the correlation of the variables in the rock genre data.
#Look at how all the variables in Rock are correlated
pairs(dfRock[,c("popularity", "danceability", "energy", "loudness", "speechiness", "instrumentalness", "liveness", "valence", "tempo", "duration_min")])
We did not see any good linear correlation between popularity and any of the other variables. Even so, we created predictive models to see if we would be able to make predictions with any level of accuracy.
To prepare to try to predict if a rock song will be popular based on some of its metrics we create a binary variable popular that takes a value of 1 if the popularity rating is greater than 68 and a value of 0 otherwise. The popularity rating of 68 was chosen because that is approximately the median popularity value of rock songs in the data set. The variables were chosen based on what we thought fans of rock music would care about in their music.
#Adding the popular column to the data set. This variable takes on a value of 1 if the popularity of the song is greater than 68 and 0 otherwise. We also make sure this variable was of mode factor.
dfRock$popular <- ifelse(dfRock$popularity > 68, 1, 0)
dfRock$popular <- as.factor(dfRock$popular)
The predictive model will be created using support machine vectors. First we partitioned the data into training and testing sets.
#70% of the data will be used for training the model and the remaining 30% will be used to test the model.
trainList <- createDataPartition(y = dfRock$popular, p = 0.7, list = FALSE)
trainData <- dfRock[trainList,]
testData <- dfRock[-trainList,]
The model will use the process of k-fold cross validation.
#Setting the trainControl() so it will use k-fold cross validation.
trctrl <- trainControl(method = "repeatedcv", number = 10)
#Building the model using predictors we felt rock fans would care about.
kfoldRock <- train(popular ~ energy + acousticness + valence + tempo, data = trainData, method = "svmRadial", trControl = trctrl, preProcess = c("center", "scale"))
kfoldRock
## Support Vector Machines with Radial Basis Function Kernel
##
## 150 samples
## 4 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (4), scaled (4)
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 134, 135, 135, 135, 135, 135, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.4462500 -0.09023225
## 0.50 0.5010119 0.01454727
## 1.00 0.5343452 0.07359066
##
## Tuning parameter 'sigma' was held constant at a value of 0.4751436
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.4751436 and C = 1.
#Testing and viewing the model's accuracy.
predictValues <- predict(kfoldRock, newdata = testData)
confusionMatrix(predictValues, testData$popular)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 16 11
## 1 16 21
##
## Accuracy : 0.5781
## 95% CI : (0.4482, 0.7006)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 0.1302
##
## Kappa : 0.1562
##
## Mcnemar's Test P-Value : 0.4414
##
## Sensitivity : 0.5000
## Specificity : 0.6562
## Pos Pred Value : 0.5926
## Neg Pred Value : 0.5676
## Prevalence : 0.5000
## Detection Rate : 0.2500
## Detection Prevalence : 0.4219
## Balanced Accuracy : 0.5781
##
## 'Positive' Class : 0
##
Looking at the confusion matrix we can see that this model does not do a good job of predicting popularity. The model is only slightly better than random guessing.
Next we tried creating a logistic regression model to predict if a rock song would be popular.
#Creating the logistic model using the same variables as in the svm model. We set family to binomial() so the glm() function will use logistic regression.
RockGLM <- glm(popular ~ energy + acousticness + valence + tempo, family = binomial(), data = dfRock)
summary(RockGLM)
##
## Call:
## glm(formula = popular ~ energy + acousticness + valence + tempo,
## family = binomial(), data = dfRock)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.29583 -1.17118 -0.03227 1.17836 1.26172
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.226081 1.063938 -0.212 0.832
## energy -0.097843 1.158347 -0.084 0.933
## acousticness 0.398043 1.152113 0.345 0.730
## valence -0.170054 0.654404 -0.260 0.795
## tempo 0.002881 0.005016 0.574 0.566
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 296.67 on 213 degrees of freedom
## Residual deviance: 296.12 on 209 degrees of freedom
## AIC: 306.12
##
## Number of Fisher Scoring iterations: 3
Again, our model is not successful at predicting if a song will be popular. None of the coefficients of the variables are statistically significant so this model should not be used.
Instead of trying to model popularity, which may be a very difficult task because people’s preferences and the changes in those preferences are hard to predict, we instead took a look at the top 10 most popular rock songs released in the same time intervals in which we broke up the data previously.
Many times artists’ biographies on Spotify will include a “for fans of” section where they compare their music to more well known artists. We can predict the next song of a newly signed artist might be popular if they sound like other bands that are currently popular.
#We filter the data like we did for the data containing all the genres so the data is broken up by the same time intervals.
dfRock2005 <- dfRock %>%
filter(year <= 2005) %>%
filter(year >= 2000)
dfRock2010 <- dfRock %>%
filter(year <= 2010) %>%
filter(year > 2005)
dfRock2015 <- dfRock %>%
filter(year <= 2015) %>%
filter(year > 2010)
dfRock2020 <- dfRock %>%
filter(year <= 2020) %>%
filter(year > 2015)
#Now we arrange the data in order of decreasing popularity so the most popular songs are shown first. We are going to look at the top 10 rock songs in each time period so we use slice to include only the first 10 rows of data.
Top10Rock2005 <- dfRock2005 %>%
arrange(desc(popularity)) %>%
slice(1:10)
#For viewing simplicity, we will show only the artist name, song title, the popularity rating, the release year, and the duration of the songs in each top 10 list.
Top10Rock2005 %>%
select(artist, song, popularity, year, duration_min)
## # A tibble: 10 × 5
## artist song popularity year duration_min
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Linkin Park In the End 83 2000 3.61
## 2 Linkin Park Numb 81 2003 3.09
## 3 Red Hot Chili Peppers Can't Stop 80 2002 4.48
## 4 Coldplay Clocks 79 2002 5.13
## 5 Hoobastank The Reason 79 2003 3.88
## 6 Bon Jovi It's My Life 78 2000 3.74
## 7 3 Doors Down Kryptonite 78 2000 3.90
## 8 Jimmy Eat World The Middle 78 2001 2.76
## 9 Nickelback How You Remind Me 78 2001 3.73
## 10 Franz Ferdinand Take Me Out 77 2004 3.95
Top10Rock2010 <- dfRock2010 %>%
arrange(desc(popularity)) %>%
slice(1:10)
Top10Rock2010 %>%
select(artist, song, popularity, year, duration_min)
## # A tibble: 10 × 5
## artist song popularity year duration_min
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 My Chemical Romance Teenagers 80 2006 2.70
## 2 Coldplay Viva La Vida 80 2008 4.04
## 3 Kings of Leon Sex on Fire 80 2008 3.39
## 4 Red Hot Chili Peppers Snow (Hey Oh) 79 2006 5.58
## 5 Foo Fighters The Pretender 78 2007 4.49
## 6 The Offspring You're Gonna Go Far, Kid 78 2008 2.96
## 7 Owl City Fireflies 78 2009 3.81
## 8 Linkin Park What I've Done 77 2007 3.43
## 9 MGMT Kids 77 2007 5.05
## 10 Empire of the Sun Walking On A Dream 77 2008 3.31
Top10Rock2015 <- dfRock2015 %>%
arrange(desc(popularity)) %>%
slice(1:10)
Top10Rock2015 %>%
select(artist, song, popularity, year, duration_min)
## # A tibble: 10 × 5
## artist song popularity year duration_min
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 The Neighbourhood Sweater Weather 89 2013 4.01
## 2 The Neighbourhood Daddy Issues 85 2015 4.34
## 3 Arctic Monkeys Why'd You Only Call Me When … 84 2013 2.69
## 4 Arctic Monkeys Do I Wanna Know? 84 2013 4.54
## 5 Twenty One Pilots Stressed Out 83 2015 3.37
## 6 Coldplay Paradise 82 2011 4.65
## 7 Foster The People Pumped Up Kicks 82 2011 3.99
## 8 Coldplay Hymn for the Weekend 82 2015 4.30
## 9 Imagine Dragons Demons 81 2012 2.96
## 10 Coldplay A Sky Full of Stars 80 2014 4.46
Top10Rock2020 <- dfRock2020 %>%
arrange(desc(popularity)) %>%
slice(1:10)
Top10Rock2020 %>%
select(artist, song, popularity, year, duration_min)
## # A tibble: 10 × 5
## artist song popularity year duration_min
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 girl in red we fell in love in october 82 2018 3.07
## 2 Alec Benjamin Let Me Down Slowly 82 2018 2.82
## 3 MGMT Little Dark Age 81 2018 5.00
## 4 Twenty One Pilots Heathens 80 2016 3.27
## 5 Imagine Dragons Whatever It Takes 80 2017 3.35
## 6 Imagine Dragons Natural 80 2018 3.16
## 7 Panic! At The Disco High Hopes 80 2018 3.18
## 8 Dominic Fike 3 Nights 78 2018 2.96
## 9 The Strumbellas Spirits 71 2016 3.39
## 10 FINNEAS Let's Fall in Love for the… 71 2018 3.17
Based on current trends we would want to sign a new artist that sounds like MGMT, Twenty One Pilots, and/or Imagine Dragons as those are most widely recognizable artists in the most recent top 10 list.
Although we were not successful in using the metrics to predict popularity, it is still helpful to know the metrics of the most recent popular songs. This can be used as a guide in the song creation process as a way to improve the odds of making a popular song.
#Summarizing the mean value of the variables that rock fans might care about the most in their music. This values are calculated using the songs with release dates between 2016 and 2020 so we can focus on the most recent trends we have data for.
dfRockSummary <- dfRock2020 %>%
summarize(avgLoudness = mean(loudness, na.rm = TRUE), avgDuration = mean(duration_min, na.rm = TRUE), avgTempo = mean(tempo, na.rm = TRUE), avgValence = mean(valence, na.rm = TRUE), avgEnergy = mean(energy, na.rm = TRUE), avgAcousticness = mean(acousticness, na.rm = TRUE))
dfRockSummary
## # A tibble: 1 × 6
## avgLoudness avgDuration avgTempo avgValence avgEnergy avgAcousticness
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -5.59 3.39 126. 0.520 0.686 0.165
The purpose of this project was to investigate the most popular music on Spotify to identify the most likely candidate to maximize the popularity of a new song on the platform and maximize the profit returned from the investment in this artist. After investigating trends in the popularity of music over time, we discovered that listeners have begun to favor increased tempo, decreased valence, increased acousticness, increased danceability, decreased energy, and decreased duration. These trends are present across all genres. These conclusions were backed up with both traditional t-test and Bayesian analyses. The most striking result was that the durations of the most popular songs in recent years have decreased by about half a minute compared to the length of popular music of the early 2000s.
Since aside from the general trends discussed previously, the qualities of music that are most important to listeners vary so greatly by genre, we decided to see if one genre of music has been consistently more popular on average than the others. Through inferential testing we were able to conclude that, on average, rock music has been the most popular genre on Spotify. The mean popularity of songs in the rock genre is almost 3 points higher than the mean popularity of all other genres. This difference inspired us to focus on the rock genre moving forward.
Rock listeners on Spotify seem to favor lower intensity and higher acousticness based on the most recent popular songs in the genre. And while we were not able to predict popularity with good accuracy using a support vector machine or logistic regression model, we are able to pinpoint the most popular recent songs and artists in the genre. We believe this information could be used in an artist description on Spotify where the artist is advertised as being for fans of these other larger name top artists. Also, by selecting an artist that will create songs with a similar overall sound as those top artists we can take advantage of Spotify’s recommendation algorithms. This way without spending additional money on advertising, our new artist’s songs will reach the ears of those listening to those most popular songs. That means we will be maximizing the number of people who hear the new songs giving us many opportunities of converting listeners into fans.
Instead of using this investigation to produce a one-off recommendation we would like to continue this research to create a renewable and sustainable business model. We believe it would be in the best interest of the company to provide funding for the creation of software that can predict popularity, allowing us to sign more new artists while maximizing the number of successful new music releases. Since we would be consistently producing music with a very good chance of gaining traction through Spotify and reaching a very wide audience, we expect a return on investment within 3 years through the combined revenue sources of Spotify streams, albums sales, merchandise sales, and concerts.